(19) 



J 



(12) 



EuropSisches Patentamt 
European Patent Office 
Off ice europgen d s brevets 



(43) Date of publication: 

18.06.1997 Bulletin 1997/25 



(n) EP 0 779 609 A2 

EUROPEAN PATENT APPLICATION 

(51) Int. CI. 6 : G10L3/00 



(21) Application number: 96119973.4 

(22) Date of filing: 12.12.1996 



(84) Designated Contracting States: 
BE DE FR GB 

(30) Priority: 13.12.1995 JP 324305/95 

(71 ) Applicant: NEC CORPORATION 
Tokyo (JP) 



(72) Inventor: Takagi, Keizaburo, 
c/o NEC Corporation 
Tokyo (JP) 

(74) Representative: Betten & Resch 
Reichenbachstrasse 19 
80469 Munchen (DE) 



(54) Speech adaptation system and speech recognizer 

(57) An analyzing unit (1) converts an input speech 
into a feature vector time series. A reference pattern 
storing unit (3) stores the feature vector time series 
obtained by the same manner as in the analyzing unit. 
A matching unit (2) correlates for time axis the input »£<- 
speech feature vector time series and the reference pat- 
terns to one another. An environmental adapting unit (4) 
performs the environmental adaptation between the 
input speech feature vector time series and the refer- 
ence patterns according to the result of matching in the 
matching unit (2). A speaker adapting unit (6) performs 
the adaptation concerning the speaker between the 
environmentally adapted reference patterns from the 
environmental adapting unit (4) and the input speech 
feature vector time series. 
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Description 

BACKGROUND OF THE INVENTION 

The present invention relates to a speech recogni- 
tion based on an adaptation techniques in speech rec- 
ognition and, more particularly, to techniques of 
improving the recognition performance by effecting 
adaptation of the difference between an input speech 
and a speech reference pattern. 

It is well known in the art that the recognition effi- 
ciency of a speech is reduced due to the character dif- 
ferences between the speech and the speech reference 
pattern. Among these differences, particularly those 
which are significant speech recognition efficiency 
reduction causes are largely classified into two types. In 
one of these types, the causes are attributable to the 
environments in which the speech is produced by the 
speaker. In the other type, the causes are attributable to 
the speech of the speaker himself or herself. The envi- 
ronmental causes are further classified into two catego- 
ries. One of these categories is predicated in 
background noises or like additive noises which are 
introduced simultaneously with the speaker's speech 
and additively affecting the speech spectrum. The other 
cause category is predicated in line distortions such as 
microphone or telephone line transmission characteris- 
tics distortions which distort the spectrum itself. 

Various adaptation methods have been proposed 
to cope with the character differences which are attribut- 
able to the speech environments. One such adaptation 
method aims at coping with the two environmental 
cause categories, i.e., the additive noises and the line 
distortions, to prevent the environmental speech recog- 
nition efficiency reduction. As an example, a speech 
adaptation system used for the speech recognition sys- 
tem is disclosed in Takagi, Hattori and Watanabe, 
"Speech Recognition with Environmental Adaptation 
Function Based on Spectral Copy Images", Spring Pro- 
ceedings of the Acoustics Engineer's Association, 2-P- 
8, pp. 173-174, March 1994 (hereinafter referred to as 
Reference No. 1). 

Fig. 4 shows the speech adaptation system noted 
above. The method disclosed in Reference No. 1 will 
now be described in detail. An input speech which has 
been distorted by additive noises and transmission line 
distortions, is converted in an analyzing unit 41 into a 
time series of feature vectors. A reference pattern stor- 
ing unit 43 stores as a word reference pattern time 
series data of each recognition subject word which is 
obtained by analyzing a training speech in the same 
manner as in the analyzing unit 41. Each word refer- 
ence pattern is given beforehand labels discriminating a 
speech section and a noise section. A matching unit 42 
matches the time series of feature vectors of the input 
speech and the time series of word reference patterns 
in the reference pattern, and selects a first order word 
reference pattern. It also obtains the correlation 
between the input speech and the word reference pat- 



terns thereof with respect to the time axis. From the cor- 
relation between the first order word reference patterns 
and the input speech feature vectors (pattern) obtained 
in the matching unit 42, an environment adapting unit 44 

5 calculates the mean vectors of the speech and noise 
sections of the input speech and each word reference 
pattern. The speech and noise section mean vectors of 
the input speech are denoted by S v and N v , and the 
speech and noise section mean vectors of the word ref- 

w erence patterns are denoted by S w and N w . The envi- 
ronment adapting unit 44 performs the adaptation of the 
reference patterns by using the four mean vectors 
based on Equation 1 given below. The adapted refer- 
ence patterns are stored in an adapted reference pat- 

is tern storing unit 45. 

W(k) ={(S v -N v )/(S v -N v )}(W(k)-N v )+N v (1) 

where W(k) represents the reference patterns before 
the adaptation (k being an index of all the reference pat- 
terns), and W'(k) represents the adapted reference pat- 
terns. This adaptation permits elimination of the 
environmental difference between the reference pat- 
terns and the input speech and provision of a speech 
adaptation system, which is stable and provides excel- 
lent performance irrespective of input environment vari- 
ations. 

A different prior art adaptation technique which is 
commonly termed a speaker adaptation technique, has 
been proposed for the adaptation of the difference with 
respect to the speaker between a reference speaker's 
speech and a recognition subject speaker's speech to 
improve the speech recognition efficiency. This tech- 
nique is disclosed in Shinoda, Iso and Watanabe, 
"Speaker Adaptation with Spectrum Insertion for 
Speech Recognition", Proceedings of the Electronic 
Communication Engineer's Association, A, Vol. J 77-A, 
No. 2, pp. 120-127, February 1994 (hereinafter referred 
to as Reference No. 2). Fig. 5 shows an example of the 
speech adaptation system employed in this technique. 
In the system, an analyzing unit 51 converts an input 
speech collected from the speaker having a different 
character from the reference speaker into a time series 
of feature vectors. A reference pattern storing unit 53 
stores as respective reference patterns which is 
obtained by analyzing a training speech of the reference 
speaker in the same manner as in the analyzing unit 51 , 
and has time series multiplication procedures of recog- 
nition subject words. A matching unit 52 matches the 
input speech feature vector time series and each word 
reference pattern time series stored in the reference 
pattern storing unit 53, and selects the first order word 
reference patterns. It also obtains the correlation 
between the input speech and the word reference pat- 
terns with respect to the time axis. While in this embod- 
iment the matching unit 52 selects the first order word 
reference patterns by itself (speaker adaptation without 
trainer), in the case of giving the first word reference 
patterns beforehand (speaker adaptation with trainer), 
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the matching unit 52 may be constructed such that it 
obtains only the correlation between the input speech 
and the word reference patterns thereof with respect to 
the time axis. A speaker adapting unit 54 p rforms the 
following adaptation for each acoustical unit (or distribu- 
tion according to Reference No. 2) on the basis of the 
correlation between the first order word reference pat- 
terns obtained in the matching unit 52 and the input 
speech feature vectors. The adapted vector Aj for each 
distribution is obtained as shown below by using the 
mean value Hj of reference pattern distribution] stored in 
the reference pattern storing unit 53 and the mean value 
Hj' with respect to the input correlated to the distribution 

i- 

Aj-iij'-A, (2) 

For the distribution having no correlation of the ref- 
erence pattern in the reference pattern unit 53, the 
adaptation is performed by using so-called spectrum 
insertion on the basis of the following Equation 2 which 
is described in the Reference No. 2. 

Aj-EjWjAj (3) 

where j represents the category of the reference pat- 
tern, in which the acoustical category is present in the 
input speech. In effect, all the reference patten distribu- 
tions are adapted with respect to the speaker after 
either one of the two equations noted above. The 
adapted reference patterns are outputted from the 
speaker adapting unit 54 and stored in an adapted ref- 
erence pattern storing unit 55. 

The prior art speech adaptation system using the 
environmental adaptation as shown in Fig. 4, however, 
aims at the sole adaptation of mean environmental dif- 
ferences appearing in the speech as a whole, and is 
incapable of performing highly accurate adaptation for 
each acoustical unit such as the speaker adaptation. 
Theoretically, therefore, the system can not perform suf- 
ficient adaptation with respect to the speech, which is 
free from environmental differences and involves 
speaker differences alone. 

The prior art speech adaptation system using the 
speaker adaptation as shown in Fig. 5, performs adap- 
tation of differences appearing in the speech as a whole 
(mainly environmental causes) as well. The result of the 
adaptation thus retains both speaker differences and 
environmental differences. Where the speech to be 
adapted and the speech at the time of the speech rec- 
ognition are different in the environment, therefore, a 
satisfactory result of adaptation can not be obtained due 
to the differences stemming from the environmental dif- 
ferences. A satisfactory result of adaptation also can not 
be obtained due to th environmental differences in the 
case where the speech to be adapted and those col- 
lected in various different environments are coexistent. 



SUMMARY OF THE INVENTION 

The present invention seeks to solve the problems 
discussed above, and its object is to provide a speech 
5 adaptation system, which can effect highly accurate 
adaptation by extracting only environment-independent 
speaker differences with high accuracy irrespective of 
the environment, in which the speech to be adapted is 
collected. 

io According to the present invention, adaptation with 
respect to the speaker is performed after environmental 
differences have been removed from the input speech 
to be adapted by using environmental adaptation. It is 
thus possible to provide a highly accurate speech adap- 
ts tation system, which is not affected by the speech envi- 
ronment of the input speech and solve problems which 
could not have been solved with the sole prior art 
speaker adaptation or environmental adaptation. 

According to one aspect of the present invention, 

20 there is provided a speech adaptation system compris- 
ing: a reference pattern storing unit for storing a time 
series of feature vectors of a reference speaker's 
speech collected from a reference speaker in a refer- 
ence speech environment, the time series being 

25 obtained by converting the reference speaker's speech 
in a predetermined manner; an analyzing unit for con- 
verting an input speech collected from an input speaker 
in an input environment into a time series of feature vec- 
tors of the input speech in the predetermined manner; a 

30 matching unit for time axis correlating the input speech 
feature vector time series and the reference patterns to 
one another and outputting the result of matching thus 
obtained; an environmental adapting unit for adapting 
the reference patterns according to the result of match- 

35 ing into a state, in which differences concerning the 
speech environment between the input speech feature 
vector time series and the reference patterns are 
adapted, and outputting the environmentally adapted 
reference patterns thus obtained; and a speaker adapt- 

40 ing unit for adapting the environmentally adapted refer- 
ence patterns into a state, in which differences 
concerning the speaker between the environmentally 
adapted reference patterns and the input speech fea- 
ture vector time series, and outputting the speaker 

45 adapted reference patterns thus obtained. 

The feature vectors include cepstrum or spectrum, 
and the environmental adapting unit environmentally 
adapts the reference patterns by using a difference con- 
cerning the cepstrum or logarithmic spectrum between 

so the speech section mean spectra of the input speech 
and the reference patterns correlated to one another. 
The environmental adapting unit environmentally 
adapts the reference patterns by converting the spec- 
trum W(k) of reference patterns k into 

55 

{(S v- N v )(W(k)-N W )}/(S W -N W )+N v 

where S v is the speech section mean spectrum of the 
input speech, S w is the speech section mean spectrum 
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of the reference patterns, N v is the noise section mean 
spectrum of the input speech, and N w is the noise s c- 
tion mean spectrum of the reference patterns, these 
mean spectra S v , S w , N v and N w being obtained 
between the feature vectors of the input speech and the 5 
reference patterns. 

The speaker adapting unit adapts the environmen- 
tally adapted reference patterns for each acoustical unit 
constituting a part of a word in the reference patterns by 
using, for an acoustical unit with a correlation or at least 
a predetermined number of correlations involved, an 
adapted vector as the difference or ratio between the 
mean feature vectors of the acoustical unit and corre- 
lated input speech and, for an acoustical unit with no 
correlation or correlations less in number than a prede- 
termined number involved, an adapted vector, which is 
obtained through calculation of an weighted sum of the 
adapted vectors of the acoustical units with a correlation 
involved by using weights corresponding to the dis- 
tances of the acoustical unit from the acoustical units 
with a correlation involved. 

The speech adaptation system further comprises a 
tree structure reference pattern storing unit, in which 
acoustical units mutually spaced apart by small dis- 
tances in the reference patterns are arranged in a tree 
structure array with hierarchical nodes, the nodes hav- 
ing nodes or acoustical units as children, the lowermost 
ones of the nodes having acoustical units as children, 
the nodes each having a storage of a typical adapted 
vector as the mean adapted vector of all the acoustical 
units with a correlation involved and the sum of the num- 
bers of correlations involved in all the lower acoustical 
units, the speaker adapting unit adapts the environmen- 
tally adapted reference patterns for each acoustical unit 
constituting a part of a word in the reference patterns by 
using, for an acoustical unit with a correlation or at least 
a predetermined number of correlations involved, an 
adapted vector as the difference or ratio between the 
mean feature vectors of the acoustical unit and corre- 
lated input speech and, for an acoustical unit with no 
correlation or correlations less in number than a prede- 
termined number involved, as the adapted vector of the 
acoustical unit a typical adapted vector of the lowest 
nodes among parent nodes of the acoustical units in the 
tree structure reference pattern storing unit, the parent 
nodes being selected form those with at least a prede- 
termined correlation sum number. 

The speech recognition system comprising the 
speech adaptation system according to the above, and 
a recognizing unit for selecting a speaker adapted refer- 
ence pattern most resembling the input speech and out- 
putting the category to which the selected reference 
pattern belongs as a result of recognition. 

According to other aspect of the present invention, 
there is provided a speech adaptation system compris- 
ing: an analyzing unit for converting an input speech 
into a feature vector time series; a reference pattern 
storing unit for converting a reference speaker's speech 
into a feature vector tine series in the same manner as 



in the analyzing unit and storing the feature vector time 
series thus obtained; a matching unit for tim axis corre- 
lating the input speech f atur vector time seri s and 
the reference patterns to one another; an environmental 
adapting unit for making environmental adaptation 
between the input speech feature vector time series and 
the reference patterns according to the result of match- 
ing in the matching unit; and a speaker adapting unit for 
making adaptation concerning the speaker between the 
environmentally adapted reference patterns from the 
environmental adapting unit and the input speech fea- 
ture vector time series. 

Other objects and features will be clarified from the 
following description with reference to attached draw- 
ings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram of the speech adaptation 
system according to one aspect of the present 
invention; 

Fig. 2 is a block diagram of the speech adaptation 
system according to other aspect of the present 
invention; 

Fig. 3 is a block diagram of the speech recognition 
system according to the present invention; 
Fig. 4 is a block diagram of a prior speech adapta- 
tion system; and 

Fig. 5 is a block diagram of another speech adapta- 
tion system. 

PREFERRED EMBODIMENTS OF THE INVENTION 

The functions of a speech adaptation system 
according to a first embodiment of the present invention 
will now be described with reference to Fig. 1. This 
speech adaptation system uses the technique shown in 
Reference No. 1 as the environmental adapting unit 4 
and the technique shown in Reference No. 2 as the 
speaker adapting unit 6. However, it is possible to use 
other techniques as well so long as the environmental 
adaptation and the speaker adaptation are made. The 
analyzing unit 1 converts a noise-containing input 
speech into a feature vector time series. Various exten- 
sively adapted feature vector examples are described 
in, for instance, Furui, "Digital Speech Processing", 
published by Tohkai University, pp. 154-160, 1985 
(hereinafter referred to as Reference No. 3). Here, a 
case of using spectra obtainable by the LPC analysis, 
the FFT analysis, etc. is taken, and the spectrum deriva- 
tion procedures are not descrbed. The time series of an 
obtained spectrum is labeled X(t) (t being the discrete 
time). It is possible to use cepstrum as feature vector. 
However, since the spectrum and the cepstrum are 
obviously converse to each other, only the case of using 
spectrum is taken. Generally, it is difficult to accurately 
collect th I ading and trailing ends of the speech, thus 
leading to cases of missing of a consonant at the lead- 
ing end of the speech. To avoid such missing, the 
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speech analysis is usually started and ended slightly 
before and after the start and end of the given input 
speech. In the reference pattern storing unit 3, refer- 
ence patterns obtained as a result of analysis of a refer- 
ence speaker's speech in the same manner as in the 5 
analyzing unit 1 are stored beforehand. The matching 
unit 2 correlates the input speech feature vector time 
series X(t) and the reference patterns to one another. 
The environmental adapting unit 4 outputs the mean 
vectors of the input speech and the reference patterns 10 
in the speech and noise sections thereof. The mean 
vectors of the speech and noise sections of the input 
speech are denoted by S v and N v , and the mean vec- 
tors of the speech and noise sections of the reference 
patterns in the reference pattern storing unit 3 are 15 
denoted by S w and N w . The environmental adapting 
unit 4 adapts the reference patterns by using the four 
mean vectors and also using Equation 3 below. The 
environmentally adapted reference patterns are stored 
in the environmentally adapted reference pattern stor- 20 
ing unit 5. 

W(t) = {(Sv-Nv)/(Sw-Nw)}(W(t)-Nw)+Nv (4) ' 

where W(t) represents the reference patterns before the 25 
adaptation, and W'(t) represents the adapted reference 
patterns. It is well known that the reference patterns 
having been environmentally adapted are free from 
environmental differences between the input and the 
reference patterns and have excellent performance with 30 
respect to environmental variations. The speaker adapt- 
ing unit 6 compensates the differences between the ref- 
erence patterns having been environmentally adapted 
and the input speech feature vector time series X(t) for 
each acoustical unit. Here, the acoustical unit is a distri- 35 
bution, and the following adaptation is performed for 
each distribution. 

Using the mean value uj of the distribution j of the 
environmentally adapted reference patterns and the 
mean value Xj' with respect to the distribution j', the 40 
adapted vector Aj for each distribution is obtained as 

Ai-Xj'-li, (5) 

With respect to the distributions j having no correla- 45 
tion of the environmentally adapted reference patterns, 
the adaptation is performed by a process called spec- 
trum insertion represented by Equation 4 shown below 
as described in Reference No. 2. 

so 

Aj-EjWjAj (6) 

where j represents the category of the reference pat- 
terns, in which an acoustical category is present in the 
input speech. In effect, all the distributions of the refer- ss 
ence patterns are speaker adapted on the basis of 
either one of the above two equations. The speaker 
adapted reference patterns are outputted from the 
speaker adapting unit 6 and stored in the speaker 



adapted reference pattern storing unit 7. It will be seen 
that the speaker adapting unit 6 takes the differences for 
each acoustical unit, which could have not been 
removed by the environmental adaptation, as differ- 
ences attributable to the speaker and effects highly 
accurate adaptation for each acoustical unit. 

According to the present invention, since the 
speaker adaptation is performed after the environmen- 
tal adaptation, it is possible to provide a speech adapta- 
tion system, which permits highly accurate speaker 
adaptation without being adversely affected by the envi- 
ronments of the input speech. That is, it is possible to 
obtain effects which can not be obtained with the sole 
prior art speech adaptation system. 

The speech adaptation system shown in Fig. 1 will 
now be described in detail. This speech adaptation sys- 
tem comprises the analyzer 1 , which converts the input 
speech which is collected from the input speech 
speaker in the input speech environment in a predeter- 
mined manner to an input speech feature vector time 
series. In the reference pattern storing unit 3, a refer- 
ence speaker's speech feature vector time series, which 
has been obtained from the conversion of the reference 
speaker's speech collected from the reference speaker 
in a reference speech environment in the same prede- 
termined manner as in the analyzing unit 1. as refer- 
ence patterns. The matching unit 2 matches the input 
speech feature vector time series and the reference pat- 
terns by correlating the two with respect to the time axis, 
and outputs the result of matching. The environmental 
adapting unit 4 adapts the reference patterns according 
to the result of matching to a state, in which the differ- 
ences concerning the speech environment between the 
input speech feature vector time series and the refer- 
ence patterns have been adapted. The environmentally 
adapted reference patterns are stored in the environ- 
mentally adapted reference pattern storing unit 5. The 
speaker adapting unit 6 adapts the environmentally 
adapted reference patterns to a state, in which the dif- 
ferences concerning the speaker between the environ- 
mentally adapted reference patterns and the input 
speech feature vector time series have been adapted. 
The speaker adapted reference patterns are outputted 
and stored in the speaker adapted reference pattern 
storing unit 7. 

The analyzing unit 1 converts the input speech, 
which is collected from an unknown person and con- 
tains noise, into a feature vector time series for match- 
ing. Conceivable as feature vectors which are 
extensively applied, are power data, changes in power 
data, cepstrum, linear regression coefficients of cep- 
strum, etc. It is also possible to combine them into fea- 
ture vectors. As further alternatives, it is possible to use 
the spectrum itself or the logarithmic spectrum. The 
input speech usually contains non-speech portions in 
which only ambient noise is present. The reference pat- 
tern storing unit 3 stores the reference speaker's 
speech as reference patterns through analysis made in 
th same manner as in the analyzing unit 1 . The refer- 



5 



snmirv «-pd n 77 aero a o i ^ 



9 



EP0779 609A2 



10 



ence patterns may be thos using HMM (Hidden 
Marokov Model) as described in Reference No. 3, pp. 
162-170, vector quantized codebooks or speech feature 
vectors themselves. The matching unit 2 correlates the 
reference patterns and the input speech feature vector 
time series to one another. For this correlation, the time 
axis normalization matching may be made as DP 
matching or HMM process. The environmental adapting 
unit 4 performs the adaptation concerning the environ- 
ment by using the correlation obtained in the matching 
unit 2. 

In a second embodiment of the speech adaptation 
system according to the present invention, a CMN (Cep- 
strum Mean Normalization) process is used for the envi- 
ronmental adapting unit 4. As an example, in a speech 
adaptation system shown in A. E. Rosenberg, et al, 
"Cepstral Channel Normalization Technique for HMM- 
Based Speaker Verification", ICSLP 94, S31, 1, pp. 
1835-1838, 1994 (hereinafter referred to as Reference 
No. 4), cepstrum is used as feature vectors, and only 
speech parts of the input speech are adapted. Specifi- 
cally, denoting the feature vectors (cepstrum) of the 
speech parts of the input speech by y t , the mean value 
of the feature vectors of the speech parts by y\ and the 
mean value of the feature vectors of the reference pat- 
tern speech parts by y(tr)\ the adaptation is performed 
as 

yt<-yf(y-y(frn (7) 

The cepstrum of the speech parts of the input speech is 
substituted for by using the mean cepstrum difference 
between the input speech and the reference pattern 
speech parts. It is of course possible to substitute for 
y t (tr) on the side of the reference patterns for normaliza- 
tion as 

y,(tr)<-y t (tr)+(/-y(tr)') (8) 

While the above second embodiment of the speech 
adaptation system according to the present invention 
used cepstrum as feature vectors, since the cepstrum 
and the logarithmic spectrum are obviously in a one-to- 
one convertible relation to each other, it is possible to 
use the logarithmic spectrum in substitution for the cep- 
strum. 

A third embodiment of the speech adaptation sys- 
tem according to the present invention, the environmen- 
tal adapting unit 4 performs the adaptation as in, for 
instance, Reference No. 1 . Denoting the mean spectra 
in the speech and noise sections of the input speech by 
S v and N v and the mean spectra in the speech and 
noise sections in the reference patterns by S w and N w , 
the environmental adapting unit 4 adapts the reference 
patterns after, for instance. Equation 5 given below. 

W(t)' = {{S v -N v )/(S w -N w )}(W(t)-NJ+N v (9) 

where W(t). represents the reference patterns (t being 



the index of all the reference patterns) before the adap- 
tation, and W(t) represents the adapted reference pat- 
terns. While in this embodiment the reference patterns 
are adapted, it is also possible to process the input 
5 speech in the same manner. While the adaptation is 
performed on the spectrum, where the feature vectors 
are the cepstrum, it can be readily realized by providing 
a cepstrum/spectrum converter. In this case, the mean 
vectors may be obtained on the cepstrum or after con- 
to version into the spectrum. 

The speaker adapting unit 6 performs the speaker 
adaptation using the environmentally adapted reference 
patterns from the environmentally adapting unit 4. Gen- 
erally, various speaker adaptation techniques have 
is been proposed. Here, a commonly termed spectrum 
insertion technique (Reference No. 2) will be described 
as a fourth embodiment of the speech adaptation sys- 
tem according to the present invention. It is possible to 
apply other speaker adaptation techniques similarly to 
the speaker adapting unit 6. Using the mean value jij of 
the distributions j of the environmentally adapted refer- 
ence patterns and the mean value Xj' of the input corre- 
lated to the distributions j, the adapted vector Aj of each 
distribution is obtained as 

Aj-Xj-Hj 00) 

The distributions i of the reference patterns in the 
environmentally adapted reference pattern storing unit 
5 having no correlation, are adapted using a process 
called spectrum insertion expressed by Equation 6 as 
described in Reference No. 2 as 

Ai-EjWyA, (11) 

where j represents the category of the reference pat- 
terns, in which the acoustical category is present in the 
input speech. The speaker adapting unit 6 adapts all the 
reference patterns k which belong to the acoustical cat- 
egory i or j as 

Hk ,= M k +A (12) 

where A is either A| or Aj selected in dependence on the 
kind of k, n k is either \i t or jij selected in dependence on 
the kind of k, and ]i{ represents either uy or U|' in 
dependence on the kind of k. While this embodiment 
used the adapted vectors for greatly adapting the refer- 
ence patterns in the storing unit 5, it is possible to con- 
trol the degree of adaptation and prevent great 
adaptation by using, for instance, an adequate coeffi- 
cient a such as 

Mk={(1+a)Hk+^(1+«) 03) 

While in this embodiment the speaker adapting unit 6 
adapts only the reference patterns in the environmen- 
tally adapted reference pattern storing unit 5, it is of 
course possible to process the input speech in the same 
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manner. 

Fig. 2 is a block diagram showing a fifth embodi- 
ment of the speech adaptation system according to the 
present invention. This speech adaptation system com- 
prises a tree structure reference pattern storing unit 28 s 
in addition to the speech adaptation system shown in 
Fig. 1 . In the tree structure reference pattern storing unit 
28. acoustical units that are mutually spaced apart by 
small distances in the reference patterns, are arranged 
in a tree structure array with hierarchical nodes. The 10 
nodes have nodes or acoustical units as their children. 
The lowermost nodes have acoustical units as their chil- 
dren. In each node, a typical adapted vector as the 
mean adapted vector of all the acoustical units with a 
correlation involved and the sum of the numbers of cor- is 
relations involved in all the lower acoustical units are 
stored. The speaker adapting unit 6 performs adapta- 
tion of the reference patterns for each acoustical unit 
(such as a sound element, a syllable, a distribution, etc.) 
constituting a part of a word. For an acoustical unit with 20 
a correlation or at least a predetermined number of cor- 
relations involved, the unit 6 performs adaptation by 
using the difference or ratio between the mean feature 
vectors of that acoustical unit and the input speech cor- 
related thereto. For an acoustical unit with no correlation 25 
or correlations less in number than a predetermined 
number involved, the unit 6 performs the adaptation by 
using, as the adapted vector of that acoustical unit, a 
typical adapted vector of the lowest nodes among the 
nodes, of which the sum of the numbers of correlations 30 
taking place in all the lower acoustical units is greater 
than a predetermined number. 

The speaker adapting unit 6 uses the tree structure 
reference pattern storing unit 28 for the adaptation. In 
the unit 28, all the distributions of the reference patterns 35 
stored in the reference pattern storing unit 3 are 
arranged beforehand in a tree structure array using a 
method described in Shinoda and Watanabe, "Speaker 
Adaptation Using Probability Distributions in a Tree 
Structure Array", Spring Proceedings of the Acoustical 40 
Engineers Association, 2-5-10, pp. 49-50, March 1995 
(hereinafter referred to as Reference No. 5). In the array, 
resembling distributions belong to the same nodes. 
Using the mean value \i } of the distributions j of the envi- 
ronmentally adapted reference patterns and the mean 45 
value Xj' of the input correlated to the distributions j, the 
adapted vector Aj is obtained for each distribution as 

A j = X j -R j (14) 

50 

For the environmentally adapted reference pattern 
distributions i having no correlation or correlations less 
in number than a predetermined number involved, the 
adaptation is performed in a method as described in 
Reference No. 5. Specifically, the environmentally 55 
adapted reference patterns are adapted by making 
examination of the tree structure upward from the reef 
(i.e., lowermost) nodes, and using the typical adapted 
vector of the nodes with at least a predetermined 



number of correlations involved as the adapted vectors 
of these distributions j. The speaker adapted referenc 
patterns are stored in the speaker adapted reference 
pattern storing unit 7. 

Fig. 3 is a block diagram showing a speech recog- 
nition system according to the present invention, which 
is constructed by using the first embodiment of the 
speech adaptation system according to the present 
invention. In this speech recognition system, having the 
speech adaptation system according to either one of the 
first to fifth embodiments of the present invention, a rec- 
ognizing unit 8 performs matching like the usual speech 
recognition between the reference patterns in the 
speaker adapted reference pattern storing unit 7 and 
the input speech and outputs a first order result as the 
recognition result. 

As has been shown in the foregoing, the speech 
adaptation system according to the first embodiment of 
the present invention performs the speaker adaptation 
after removal of the environmental differences by the 
environmental adaptation. It is thus possible to obtain 
highly accurate adaptation which could not have been 
obtained as the sole environmental adaptation in the 
prior art. Besides, in the speaker adaptation only the 
environment-independent speaker differences can be 
extracted with high accuracy, so that it is possible to 
realize highly accurate adaptation. 

With the speech adaptation system according to 
the second embodiment of the present invention, in 
addition to the effects obtainable with the speech adap- 
tation system according to the first embodiment of the 
present invention, the environmental adaptation can be 
obtained with the sole cepstra! differences. It is thus 
possible to provide a system, which is less subject to 
calculation and memory amount increases and is more 
inexpensive. 

With the speech adaptation system according to 
the third embodiment of the present invention, in addi- 
tion to the effects obtainable with the speech adaptation 
systems according to the first embodiments of the 
present invention, a higher environmental adaptation 
accuracy can be obtained compared to that of the 
speech adaptation system according to the second 
embodiment of the present invention. It is thus possible 
to realize a more accurate speech adaptation system. 

With the speech adaptation system according to 
the fourth embodiment of the present invention, in addi- 
tion to the effects obtainable with the speech adaptation 
system according to the first embodiment of the present 
invention, the acoustical units with no correlation 
involved also be highly accurately adapted. It is thus 
possible to realize highly accurate speaker adaptation 
with less data and provide a more accurate speech 
adaptation system. 

With the speech adaptation system according to 
the fifth embodiment of the present invention, in addition 
to the effects obtainable with the speech adaptation sys- 
tem according to the first embodiment of the present 
invention, stable speaker adaptation is obtainable with- 
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out parameter control corresponding to the data amount 
that is necessary with the speech adaptation system 
according to the fourth embodiment of the present 
invention. It is thus possible to provide a more accurate 
speech adaptation system. 

As a consequence, it is possfole to provide a highly 
accurate speech recognition system, which enjoys the 
effects obtainable with the speech adaptation systems 
according to the first to fifth embodiments of the present 
invention. 

Changes in construction will occur to those skilled 
in the art and various apparently different modifications 
and embodiments may be made without departing from 
the scope of the present invention. The matter set forth 
in the foregoing description and accompanying draw- 
ings is offered by way of illustration only. It is therefore 
intended that the foregoing description be regarded as 
illustrative rather than limiting. 

Claims 

1 . A speech adaptation system comprising: 

a reference pattern storing unit for storing a 
time series of feature vectors of a reference 
speaker's speech collected from a reference 
speaker in a reference speech environment, 
the time series being obtained by converting 
the reference speaker's speech in a predeter- 
mined manner; 

an analyzing unit for converting an input 
speech collected from an input speaker in an 
input environment into a time series of feature 
vectors of the input speech in the predeter- 
mined manner; 

a matching unit for time axis correlating the 
input speech feature vector time series and the 
reference patterns to one another and output- 
ting the result of matching thus obtained; 
an environmental adapting unit for adapting the 
reference patterns according to the result of 
matching into a state, in which differences con- 
cerning the speech environment between the 
input speech feature vector time series and the 
reference patterns are adapted, and outputting 
the environmentally adapted reference pat- 
terns thus obtained; and 
a speaker adapting unit for adapting the envi- 
ronmentally adapted reference patterns into a 
state, in which differences concerning the 
speaker between the environmentally adapted 
reference patterns and the input speech fea- 
ture vector time series, and outputting the 
speaker adapted reference patterns thus 
obtained. 

2. The speech adaptation system according to claim 
1, wherein the feature vectors include cepstrum or 
spectrum, and the environmental adapting unit 



environmentally adapts the reference patterns by 
using a difference concerning the cepstrum or log- 
arithmic spectrum between the speech section 
mean spectra of the input speech and the reference 
5 patterns correlated to one another. 

3. The speed adaptation system according to claim 1 , 
wherein the environmental adapting unit environ- 
mentally adapts the reference patterns by convert- 

10 ing the spectrum W(k) of reference patterns k into 

l(S V -N v )(W(k)-N W )}/(S W -N W )+N v 

where S v is the speech section mean spectrum of 
is the input speech, S w is the speech section mean 
spectrum of the reference patterns, N v is the noise 
section mean spectrum of the input speech, and 
N w is the noise section mean spectrum of the refer- 
ence patterns, these mean spectra Sy, S w , N v and 
20 N w being obtained between the feature vectors of 
the input speech and the reference patterns. 

4. The speech adaptation system according to claim 
1, wherein the speaker adapting unit adapts the 

25 environmentally adapted reference patterns for 
each acoustical unit constituting a part of a word in 
the reference patterns by using, for an acoustical 
unit with a correlation or at least a predetermined 
number of correlations involved, an adapted vector 

30 as the difference or ratio between the mean feature 
vectors of the acoustical unit and correlated input 
speech and, for an acoustical unit with no correla- 
tion or correlations less in number than a predeter- 
mined number involved, an adapted vector, which is 

35 obtained through calculation of an weighted sum of 
the adapted vectors of the acoustical units with a 
correlation involved by using weights correspond- 
ing to the distances of the acoustical unit from the 
acoustical units with a correlation involved. 

40 

5. The speech adaptation system according to claim 
1, which further comprises a tree structure refer- 
ence pattern storing unit, in which acoustical units 
mutually spaced apart by small distances in the ref- 

45 erence patterns are arranged in a tree structure 
array with hierarchical nodes, the nodes having 
nodes or acoustical units as children, the lowermost 
ones of the nodes having acoustical units as chil- 
dren, the nodes each having a storage of a typical 

so adapted vector as the mean adapted vector of all 
the acoustical units with a correlation involved and 
the sum of the numbers of correlations involved in 
all the lower acoustical units. 

the speaker adapting unit adapts the envi- 

55 ronmentally adapted reference patterns for each 
acoustical unit constituting a part of a word in the 
reference patterns by using, for an acoustical unit 
with a correlation or at least a predetermined 
number of correlations involved, an adapted vector 
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as the difference or ratio between the mean feature 
vectors of the acoustical unit and correlated input 
speech and, for an acoustical unit with no correla- 
tion or correlations less in number than a predeter- 
mined number involved, as the adapted vector of 5 
the acoustical unit a typical adapted vector of the 
lowest nodes among parent nodes of the acoustical 
units in the tree structure relerence pattern storing 
unit, the parent nodes being selected form those 
with at least a predetermined correlation sum 10 
number. 

6. A speech recognition system comprising the 
speech adaptation system according to either one 
of claims 1 to 5, and a recognizing unit for selecting 
a speaker adapted reference pattern most resem- 
bling the input speech and outputting the category 
to which the selected reference pattern belongs as 
a result of recognition. 

7. A speech adaptation system comprising: 

an analyzing unit for converting an input 
speech into a feature vector time series; 
a reference pattern storing unit for converting a 25 
reference speaker's speech into a feature vec- 
tor tine series in the same manner as in the 
analyzing unit and storing the feature vector 
time series thus obtained; 
a matching unit for time axis correlating the 30 
input speech feature vector time series and the 
reference patterns to one another; 
an environmental adapting unit for making 
environmental adaptation between the input 
speech feature vector time series and the refer- 35 
ence patterns according to the result of match- 
ing in the matching unit; and 
a speaker adapting unit for making adaptation 
concerning the speaker between the environ- 
mentally adapted reference patterns from the 40 
environmental adapting unit and the input 
speech feature vector time series. 
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(57) An analyzing unit (1 ) converts an input speech 
into a feature vector time series. A reference pattern 
storing unit (3) stores the feature vector time series 
obtained by the same manner as in the analyzing unit. 
A matching unit (2) correlates for time axis the input 
speech feature vector time series and the reference pat- 
terns to one another. An environmental adapting unit (4) 
performs the environmental adaptation between the 



input speech feature vector time series and the refer- 
ence patterns according to the result of matching in the 
matching unit (2). A speaker adapting unit (6) performs 
the adaptation concerning the speaker between the 
environmentally adapted reference patterns from the 
environmental adapting unit (4) and the input speech 
feature vector time series. 
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